First, we load a few R packages

Attribution: A lot of the material for this lecture came from the following resources

Motivation

Analyzing text data such as Twitter content, books or news articles is commonly performed in data science.

In this lecture, we will be asking the following questions:

Which are the most commonly used words from Jane Austen’s novels? Which are the most positive or negative words? How does the sentiment (e.g. positive vs negative) of the text change across each novel?

image source

To answer these questions, we will need to learn about a few things. Specifically,

  1. How to convert words in documents to a tidy text format using the tidytext R package
  2. A little bit about sentiment analysis

Learning how to make text data tidy

Throughout this course, we have learned about the tidy data principles and the tidyverse R packages as a way to make handling data easier and more effective. These packages depend on data being formatted in a particular way. The idea with tidy text is to treat text as data frames of individual words and apply the same tidy data principles to make text mining tasks easier and consistent with already developed tools.

What is the tidy text format?

When dealing with text data, the tidy text format is defined as a table with one-token-per-row, where a token is a meaningful unit of text (e.g. a word, pair of words, sentence, paragraph, etc). Using a given set of token, we can tokenize text, or split the text into the defined tokens of interest along the rows. We will learn more about how to do this using functions in the tidytext R package.

In contrast, other data structures that are commonly used to store text data in text mining applications:

  • string: text can, of course, be stored as strings, i.e., character vectors, within R, and often text data is first read into memory in this form.
  • corpus: these types of objects typically contain raw strings annotated with additional metadata and details.
  • document-term matrix: This is a sparse matrix describing a collection (i.e., a corpus) of documents with one row for each document and one column for each term. The value in the matrix is typically word count.

I won’t describing these other formats in greater detail, but encourage you to read about them if interested in this topic.

Why is this format useful?

One of the biggest advantages of transforming text data to the tidy text format is that it allows data to transition smoothly between other packages that adhere to the tidyverse framework (e.g. ggplot2, dplyr, etc).

A flowchart of a typical text analysis that uses tidytext for sentiment analysis.

A flowchart of a typical text analysis that uses tidytext for sentiment analysis.

image source

In addition, a user can transition between the tidy text format for e.g data visualization with ggplot2, but then also convert data to other data structures (e.g. document-term matrix) that is commonly used in machine learning applications.

How does it work?

The main workhorse function in the tidytext R package to tokenize text data is the unnest_tokens(data, output, input) function.

In addition to the data frame (data), the function needs two basic arguments:

  1. output or the output column name that will be created as the text is unnested into it
  2. input or input column name that the text comes from

Let’s try out the unnest_tokens() function using the first paragraph in the preface of Roger’s R Programming for Data Science book.

## [1] "I started using R in 1998 when I was a college undergraduate working on my senior thesis."                                                                                                        
## [2] "The version was 0.63."                                                                                                                                                                            
## [3] "I was an applied mathematics major with a statistics concentration and I was working with Dr. Nicolas Hengartner on an analysis of word frequencies in classic texts (Shakespeare, Milton, etc.)."
## [4] "The idea was to see if we could identify the authorship of each of the texts based on how frequently they used certain words."                                                                    
## [5] "We downloaded the data from Project Gutenberg and used some basic linear discriminant analysis for the modeling."                                                                                 
## [6] "The work was eventually published and was my first ever peer-reviewed publication."                                                                                                               
## [7] "I guess you could argue it was my first real 'data science' experience."

Turns out Roger performed a similar analysis as an undergraduate student! He goes to say that back then no one was using R (but a little bit of something called S-PLUS), so I can only imagine different it was to accomplish a task like the one we are going to do today compared to when he was an undergraduate.

## # A tibble: 7 x 2
##    line text                                                              
##   <int> <chr>                                                             
## 1     1 I started using R in 1998 when I was a college undergraduate work…
## 2     2 The version was 0.63.                                             
## 3     3 I was an applied mathematics major with a statistics concentratio…
## 4     4 The idea was to see if we could identify the authorship of each o…
## 5     5 We downloaded the data from Project Gutenberg and used some basic…
## 6     6 The work was eventually published and was my first ever peer-revi…
## 7     7 I guess you could argue it was my first real 'data science' exper…

Text Mining and Tokens

Next we will use the unnest_tokens() function where we will call the output column to be created word and the input column text from the peng_preface_df.

## # A tibble: 6 x 2
##    line word   
##   <int> <chr>  
## 1     1 i      
## 2     1 started
## 3     1 using  
## 4     1 r      
## 5     1 in     
## 6     1 1998
## # A tibble: 6 x 2
##    line word      
##   <int> <chr>     
## 1     7 my        
## 2     7 first     
## 3     7 real      
## 4     7 data      
## 5     7 science   
## 6     7 experience

The argument token="words" defines the unit for tokenization. The default is “words”, but there are lots of other options.

For example, we could tokenize by “characters”:

## # A tibble: 6 x 2
##    line word 
##   <int> <chr>
## 1     1 i    
## 2     1 s    
## 3     1 t    
## 4     1 a    
## 5     1 r    
## 6     1 t

or something called “ngrams”, which is defined by Wikipedia as a “contiguous sequence of n items from a given sample of text or speech”

## # A tibble: 6 x 2
##    line word           
##   <int> <chr>          
## 1     1 i started using
## 2     1 started using r
## 3     1 using r in     
## 4     1 r in 1998      
## 5     1 in 1998 when   
## 6     1 1998 when i

Another option is to use the character_shingles option, which is similar to tokenizing like ngrams, except the units are characters instead of words.

## # A tibble: 6 x 2
##    line word 
##   <int> <chr>
## 1     1 ista 
## 2     1 star 
## 3     1 tart 
## 4     1 arte 
## 5     1 rted 
## 6     1 tedu

You can also create custom functions for tokenization.

## # A tibble: 6 x 2
##    line word   
##   <int> <chr>  
## 1     1 i      
## 2     1 started
## 3     1 using  
## 4     1 r      
## 5     1 in     
## 6     1 1998

Example: text from works of Jane Austen

We will use the text from six published novels from Jane Austen, which are available in the janeaustenr R package. The authors describe the format:

"The package provides the text in a one-row-per-line format, where a line is this context is analogous to a literal printed line in a physical book.

The package contains:

  • sensesensibility: Sense and Sensibility, published in 1811
  • prideprejudice: Pride and Prejudice, published in 1813
  • mansfieldpark: Mansfield Park, published in 1814
  • emma: Emma, published in 1815
  • northangerabbey: Northanger Abbey, published posthumously in 1818
  • persuasion: Persuasion, also published posthumously in 1818

There is also a function austen_books() that returns a tidy data frame of all 6 novels."

Let’s load in the text from prideprejudice and look at how the data is stored.

##  [1] "PRIDE AND PREJUDICE"                                                        
##  [2] ""                                                                           
##  [3] "By Jane Austen"                                                             
##  [4] ""                                                                           
##  [5] ""                                                                           
##  [6] ""                                                                           
##  [7] "Chapter 1"                                                                  
##  [8] ""                                                                           
##  [9] ""                                                                           
## [10] "It is a truth universally acknowledged, that a single man in possession"    
## [11] "of a good fortune, must be in want of a wife."                              
## [12] ""                                                                           
## [13] "However little known the feelings or views of such a man may be on his"     
## [14] "first entering a neighbourhood, this truth is so well fixed in the minds"   
## [15] "of the surrounding families, that he is considered the rightful property"   
## [16] "of some one or other of their daughters."                                   
## [17] ""                                                                           
## [18] "\"My dear Mr. Bennet,\" said his lady to him one day, \"have you heard that"
## [19] "Netherfield Park is let at last?\""                                         
## [20] ""

We see each line is in a character vector with elements of about 70 characters.

Similar to what we did above with Roger’s preface, we can turn the text of character strings into a data frame and then convert it into a one-row-per-line dataframe using the unnest_tokens() function.

## # A tibble: 122,204 x 1
##    word     
##    <chr>    
##  1 pride    
##  2 and      
##  3 prejudice
##  4 by       
##  5 jane     
##  6 austen   
##  7 chapter  
##  8 1        
##  9 it       
## 10 is       
## # ... with 122,194 more rows

We can also divide it by paragraphs:

## # A tibble: 2,126 x 1
##    paragraph                                                               
##    <chr>                                                                   
##  1 pride and prejudice                                                     
##  2 by jane austen                                                          
##  3 chapter 1                                                               
##  4 " it is a truth universally acknowledged, that a single man in possessi…
##  5 however little known the feelings or views of such a man may be on his …
##  6 "\"my dear mr. bennet,\" said his lady to him one day, \"have you heard…
##  7 mr. bennet replied that he had not.                                     
##  8 "\"but it is,\" returned she; \"for mrs. long has just been here, and s…
##  9 mr. bennet made no answer.                                              
## 10 "\"do you not want to know who has taken it?\" cried his wife impatient…
## # ... with 2,116 more rows

Note: what you name the output column, paragraph in this case, doesn’t affect it, it’s just good to give it a consistent name.

We could also divide it by sentence:

## # A tibble: 7,066 x 1
##    sentence                                                                
##    <chr>                                                                   
##  1 pride and prejudice  by jane austen    chapter 1   it is a truth univer…
##  2 however little known the feelings or views of such a man may be on his …
##  3 "\"my dear mr."                                                         
##  4 "bennet,\" said his lady to him one day, \"have you heard that netherfi…
##  5 mr.                                                                     
##  6 bennet replied that he had not.                                         
##  7 "\"but it is,\" returned she; \"for mrs."                               
##  8 "long has just been here, and she told me all about it.\""              
##  9 mr.                                                                     
## 10 bennet made no answer.                                                  
## # ... with 7,056 more rows

Note: this is tricked by terms like “Mr.” and “Mrs.”

One neat trick is that we can unnest by two layers: paragraph and then word. This lets us keep track of which paragraph is which.

## # A tibble: 2,126 x 2
##    paragraph                                              paragraph_number
##    <chr>                                                             <int>
##  1 pride and prejudice                                                   1
##  2 by jane austen                                                        2
##  3 chapter 1                                                             3
##  4 " it is a truth universally acknowledged, that a sing…                4
##  5 however little known the feelings or views of such a …                5
##  6 "\"my dear mr. bennet,\" said his lady to him one day…                6
##  7 mr. bennet replied that he had not.                                   7
##  8 "\"but it is,\" returned she; \"for mrs. long has jus…                8
##  9 mr. bennet made no answer.                                            9
## 10 "\"do you not want to know who has taken it?\" cried …               10
## # ... with 2,116 more rows

Note: We use mutate() to annotate a paragraph number quantity to keep track of pagragraphs in the original format.

## # A tibble: 122,204 x 2
##    paragraph_number word     
##               <int> <chr>    
##  1                1 pride    
##  2                1 and      
##  3                1 prejudice
##  4                2 by       
##  5                2 jane     
##  6                2 austen   
##  7                3 chapter  
##  8                3 1        
##  9                4 it       
## 10                4 is       
## # ... with 122,194 more rows

We notice there are many what are called stop words (“the”, “of”, “to”, and so forth in English). Often in text analysis, we will want to remove stop words because stop words are words that are not useful for an analysis. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join().

## Joining, by = "word"
## # A tibble: 37,246 x 2
##    paragraph_number word        
##               <int> <chr>       
##  1                1 pride       
##  2                1 prejudice   
##  3                2 jane        
##  4                2 austen      
##  5                3 chapter     
##  6                3 1           
##  7                4 truth       
##  8                4 universally 
##  9                4 acknowledged
## 10                4 single      
## # ... with 37,236 more rows

Because we have stored our data in a tidy dataset, we can tidyverse packages for exploratory data analysis.

For example, here we use dplyr’s count() function to find the most common words in the book

## # A tibble: 6 x 2
##   word          n
##   <chr>     <int>
## 1 elizabeth   597
## 2 darcy       373
## 3 bennet      294
## 4 miss        283
## 5 jane        264
## 6 bingley     257

Then use ggplot2 to plot the most commonly used words from the book.

## # A tibble: 6,009 x 2
##    word          n
##    <chr>     <int>
##  1 elizabeth   597
##  2 darcy       373
##  3 bennet      294
##  4 miss        283
##  5 jane        264
##  6 bingley     257
##  7 time        203
##  8 lady        183
##  9 sister      180
## 10 wickham     162
## # ... with 5,999 more rows

We can also do this for all of her books using the austen_books() object

## # A tibble: 6 x 2
##   text                  book               
##   <chr>                 <fct>              
## 1 SENSE AND SENSIBILITY Sense & Sensibility
## 2 ""                    Sense & Sensibility
## 3 by Jane Austen        Sense & Sensibility
## 4 ""                    Sense & Sensibility
## 5 (1811)                Sense & Sensibility
## 6 ""                    Sense & Sensibility

We can do some data wrangling that keep tracks of the line number and chapter (using a regex) to find where all the chapters are.

## # A tibble: 73,422 x 4
##    text                  book                linenumber chapter
##    <chr>                 <fct>                    <int>   <int>
##  1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
##  2 ""                    Sense & Sensibility          2       0
##  3 by Jane Austen        Sense & Sensibility          3       0
##  4 ""                    Sense & Sensibility          4       0
##  5 (1811)                Sense & Sensibility          5       0
##  6 ""                    Sense & Sensibility          6       0
##  7 ""                    Sense & Sensibility          7       0
##  8 ""                    Sense & Sensibility          8       0
##  9 ""                    Sense & Sensibility          9       0
## 10 CHAPTER 1             Sense & Sensibility         10       1
## # ... with 73,412 more rows

Finally we can restructure it to a one-token-per-row format using the unnest_tokens() function and remove stop words.

## Joining, by = "word"
## # A tibble: 217,609 x 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 sensibility
##  3 Sense & Sensibility          3       0 jane       
##  4 Sense & Sensibility          3       0 austen     
##  5 Sense & Sensibility          5       0 1811       
##  6 Sense & Sensibility         10       1 chapter    
##  7 Sense & Sensibility         10       1 1          
##  8 Sense & Sensibility         13       1 family     
##  9 Sense & Sensibility         13       1 dashwood   
## 10 Sense & Sensibility         13       1 settled    
## # ... with 217,599 more rows

Here are the most commonly used words across all of Jane Austen’s books.

Sentiment Analysis

In the previous section, we explored the tidy text format and showed how we can calculate things such as word frequency.

Next, we are going to look at something called opinion mining or sentiment analysis. The tidytext authors write:

“When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use the tools of text mining to approach the emotional content of text programmatically, as shown in the figure below”

A flowchart of a typical text analysis that uses tidytext for sentiment analysis.

A flowchart of a typical text analysis that uses tidytext for sentiment analysis.

image source

“One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.”

Let’s try using sentiment analysis on the Jane Austen books.

The sentiments dataset

Inside the tidytext package are several sentiment lexicons:

## # A tibble: 27,314 x 4
##    word        sentiment lexicon score
##    <chr>       <chr>     <chr>   <int>
##  1 abacus      trust     nrc        NA
##  2 abandon     fear      nrc        NA
##  3 abandon     negative  nrc        NA
##  4 abandon     sadness   nrc        NA
##  5 abandoned   anger     nrc        NA
##  6 abandoned   fear      nrc        NA
##  7 abandoned   negative  nrc        NA
##  8 abandoned   sadness   nrc        NA
##  9 abandonment anger     nrc        NA
## 10 abandonment fear      nrc        NA
## # ... with 27,304 more rows

These are the types of lexicons that are included.

## 
##    AFINN     bing loughran      nrc 
##     2476     6788     4149    13901

A few things to note:

  • The lexicons arebased on unigrams (single words)
  • The lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth

You can use the get_sentiments() function to extract a specific lexicon.

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

The bing lexicon categorizes words in a binary fashion into positive and negative categories

## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows

The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment

## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows

The authors of the tidytext package note:

“How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen’s novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text.”

Two other caveats:

“Not every English word is in the lexicons because many English words are pretty neutral. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in”no good" or “not true”; a lexicon-based method like this is based on unigrams only. For many kinds of text (like the narrative examples below), there are not sustained sections of sarcasm or negated text, so this is not an important effect. Also, we can use a tidy text approach to begin to understand what kinds of negation words are important in a given text; see Chapter 9 for an extended example of such an analysis."

and

“One last caveat is that the size of the chunk of text that we use to add up unigram sentiment scores can have an effect on an analysis. A text the size of many paragraphs can often have positive a nd negative sentiment averaged out to about zero, while sentence-sized or paragraph-sized text often works better.”

Joining together tidy text data with lexicons

Now that we have our data in a tidy text format and we have learned about different types of lexicons in application for sentiment analysis, we can join the words together using an inner join function.

For example, what are the most common joy words in the book Emmma? Here we will use the nrc lexicon and join the tidy_books dataset with the nrc_joy lexicon using the inner_join() function.

## Joining, by = "word"
## # A tibble: 298 x 2
##    word          n
##    <chr>     <int>
##  1 friend      166
##  2 hope        143
##  3 happy       125
##  4 love        117
##  5 deal         92
##  6 found        92
##  7 happiness    76
##  8 pretty       68
##  9 true         66
## 10 comfort      65
## # ... with 288 more rows

We can do things like investigate how the sentiment of the text changes throughout each of Jane’s novels.

Here we will use the bing lexicon, find a sentiment score for each word, and then use inner_join().

## Joining, by = "word"
## # A tibble: 6 x 5
##   book                linenumber chapter word        sentiment
##   <fct>                    <int>   <int> <chr>       <chr>    
## 1 Sense & Sensibility         16       1 respectable positive 
## 2 Sense & Sensibility         18       1 advanced    positive 
## 3 Sense & Sensibility         20       1 death       negative 
## 4 Sense & Sensibility         21       1 loss        negative 
## 5 Sense & Sensibility         25       1 comfortably positive 
## 6 Sense & Sensibility         28       1 goodness    positive

Then we can count how many positive and negative words there are in each section of the books. We create an index to help us keep track of where we are in the narriative, which uses integer division, and counts up sections of 80 lines of text.

## Joining, by = "word"
## # A tibble: 1,840 x 4
##    book                index sentiment     n
##    <fct>               <dbl> <chr>     <int>
##  1 Sense & Sensibility     0 negative     16
##  2 Sense & Sensibility     0 positive     26
##  3 Sense & Sensibility     1 negative     19
##  4 Sense & Sensibility     1 positive     44
##  5 Sense & Sensibility     2 negative     12
##  6 Sense & Sensibility     2 positive     23
##  7 Sense & Sensibility     3 negative     15
##  8 Sense & Sensibility     3 positive     22
##  9 Sense & Sensibility     4 negative     16
## 10 Sense & Sensibility     4 positive     29
## # ... with 1,830 more rows

Note: The %/% operator does integer division (x %/% y is equivalent to floor(x/y)) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiment in.

Finally, we use spread() to have positive and negative counts in different columns, and then use mutate() to calcualte a net sentiment (positive - negative).

## Joining, by = "word"
## # A tibble: 920 x 5
##    book                index negative positive sentiment
##    <fct>               <dbl>    <dbl>    <dbl>     <dbl>
##  1 Sense & Sensibility     0       16       26        10
##  2 Sense & Sensibility     1       19       44        25
##  3 Sense & Sensibility     2       12       23        11
##  4 Sense & Sensibility     3       15       22         7
##  5 Sense & Sensibility     4       16       29        13
##  6 Sense & Sensibility     5       16       39        23
##  7 Sense & Sensibility     6       24       37        13
##  8 Sense & Sensibility     7       22       39        17
##  9 Sense & Sensibility     8       30       35         5
## 10 Sense & Sensibility     9       14       18         4
## # ... with 910 more rows

Then we can plot the sentiment scores across the sections of each novel:

We can see how the sentiment trajectory of the novel changes over time.

Word clouds

You can also do things like create word clouds using the wordcloud package.

## Loading required package: RColorBrewer
## Joining, by = "word"